Lexicon Development for Varieties of Spoken Colloquial Arabic

نویسندگان

David Graff

Tim Buckwalter

Mohamed Maamouri

Hubert Jin

چکیده

In Arabic speech communities, there is a diglossic gap between written/formal Modern Standard Arabic (MSA) and spoken/casual colloquial dialectal Arabic (DA): the common spoken language has no standard representation in written form, while the language observed in texts has limited occurrence in speech. Hence the task of developing language resources to describe and model DA speech involves extra work to establish conventions for orthography and grammatical analysis. We describe work being done at the LDC to develop lexicons for DA, comprising pronunciation, morphology and part-of-speech labeling for word forms in recorded speech. Components of the approach are: (a) a two-layer transcription, providing a consonant-skeleton form and a pronunciation form; (b) manual annotation of morphology, part-of-speech and English gloss, followed by development of automatic word parsers modeled on the Buckwalter Morphological Analyzer for MSA; (c) customized user interfaces and supporting tools for all stages of annotation; and (d) a relational database for storing, emending and publishing the transcription corpus as well as the lexicon. 1. The status of Dialectal Arabic Since the mid-20 century, the term diglossia has been used to describe the sociolinguistic situation in Arabic speech communities (Ferguson, 1959): in each Arabicspeaking region, there are two distinct language systems in use. Modern Standard Arabic (MSA) is the sole system for official communication in government, news reporting and academia. While MSA is used orally in speeches, broadcast news and formal settings, only a small minority of the population has practical experience or facility in speaking it; for most users of MSA, it is like a second language, somewhat related to their primary spoken language, and its use consists mainly if not exclusively in reading, writing and listening. Spoken Arabic dialects are the primary languages in these communities, but their usage is almost exclusively oral. All formal instruction in reading and writing is conducted in and for MSA: being literate means reading and writing MSA. Differences between MSA and DA involve a variety of diachronic sound changes affecting both manner and place of articulation for several consonants, as well as alterations in derivational and inflectional morphology that may reflect a restructuring of some underlying paradigms. Efforts to establish an explicit standardization for DA orthography and grammar have arisen only recently and are very rare; for the most part, such standardization does not exist. So, even though DA speakers may be familiar with a writing system and a wide range of textual resources, this is only marginally related to their daily usage of speech. In creating corpora and linguistic annotations for DA – to address the spoken language for purposes of human language technologies – we lack some of the basic underlying resources that are typically available in other literate languages. We must first establish an orthography that will serve as an adequate “canonical” written form, but still preserve evidence of significant pronunciation variants where possible, because in the absence of an established orthographic standard, and with only limited samples of speech to work from, observed variations could have equal standing as the basis for canonical spellings and further analysis of the language. Next, given the complex morphological structure of DA, we need to develop a lexicon that will support the automation of morphological analysis, by creating a sufficient body of manual annotations. In the process, we need a means to assure that all annotations can be revisited, amended and refined in an efficient and reliable manner while both transcription and manual analysis are in progress, with suitable feedback to annotators as further transcription and manual analysis are done. 1.1. Issues for MSA-based annotation of DA The differences between MSA and DA created by diachronic sound changes are significant enough that MSA is unsuitable as a standard orthography for DA. Still, the Arabic script-based writing system is familiar to all literate speakers of DA, and there is a fairly large base of common cognate vocabulary between MSA and DA, making the use of Arabic script, and of orthographic practices that closely resemble those of MSA, an effective means for transcription of DA speech (Maamouri et al., 2004a,b). The keyboard layout for these characters can be learned fairly quickly, and transcribers find it easier to read and verify their typing when it is presented in Arabic script, rather than Latin/ASCII transliteration. It is also useful to distinguish two forms for each word: a “consonant skeleton” form, consistent with standard orthographic practice in MSA, and a “diacritized” form, using the common Arabic diacritic marks (for short vowels, consonant gemination, etc), to represent pronunciation. For morphological analysis, the situation is more difficult. Given that DA is directly related to MSA, and we have very good tools for analyzing the morphology of MSA, we first made an attempt to adapt the Buckwalter Morphological Analyzer (Buckwalter, 2004) so that it could provide candidate analyses of DA word forms. The task for annotators, we hoped, would then be simply identifying which of several possible analyses was the appropriate one for a given word. When we tried this approach on a set of transcripts drawn from a corpus of Levantine Arabic telephone conversations (Maamouri et al., 2006), we found that the differences between MSA

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning of a Chinese Spontaneous and Colloquial Speech Lexicon with Content and Filler Phrase Classification

There is significant lexical difference—words and usage of words-between spontaneous/colloquial language and the written language. This difference affects the performance of spoken language recognition systems that use statistical language models or context-free-grammars because these models are based on the written language rather than the spoken form. There are many filler phrases and colloqu...

متن کامل

Borrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab

“Borrowing” is a lingual process that is studied in diachronic linguistics. In this process a language borrows elements from another language. This process usually occurs in areas that two languages make contact with each other. In a dialect spoken in South Khorasan the language borrowing happens. Arabs living in this part of Iran probably have immigrated in the early centuries of Islam. In thi...

متن کامل

Adapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech

The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we int...

متن کامل

Cross-lingual acoustic modeling for dialectal Arabic speech recognition

Amajor problem with dialectal Arabic acoustic modeling is due to the very sparse available speech resources. In this paper, we have chosen Egyptian Colloquial Arabic (ECA) as a typical dialect. In order to benefit from existing Modern Standard Arabic (MSA) resources, a cross-lingual acoustic modeling approach is proposed that is based on supervised model adaptation. MSA acoustic models were ada...

متن کامل

NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian and Modern Standard Arabic

This paper presents NileULex, which is an Arabic sentiment lexicon containing close to six thousands Arabic words and compound phrases. Forty five percent of the terms and expressions in the lexicon are Egyptian or colloquial while fifty five percent are Modern Standard Arabic. The development of the presented lexicon has taken place over the past two years. While the collection of many of the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Lexicon Development for Varieties of Spoken Colloquial Arabic

نویسندگان

چکیده

منابع مشابه

Unsupervised Learning of a Chinese Spontaneous and Colloquial Speech Lexicon with Content and Filler Phrase Classification

Borrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab

Adapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech

Cross-lingual acoustic modeling for dialectal Arabic speech recognition

NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian and Modern Standard Arabic

عنوان ژورنال:

اشتراک گذاری